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Abstract 

The error threshold transition in a stochastic (i.e. finite population) version of 
the quasispecies model of molecular evolution is studied using finite-size scaling. For 
the single-sharp-peak replication landscape, the deterministic model exhibits a first- 
order transition at Q — Q c — l/a, where Q is the probability of exact replication of 
a molecule of length L — > oo, and a is the selective advantage of the master string. 
For sufficiently large population size, N, we show that in the critical region the 
characteristic time for the vanishing of the master strings from the population is 
described very well by the scaling assumption r = N 1 / 2 f a UQ — Q c ) TV 1 / 2 ] , where 
f a is an a-dependent scaling function. 



Short Title: error threshold in finite populations 
PACS: 87.10+e, 64.60.Cn 
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An elusive issue in the extension of Eigen's quasispecies model Q of molecular 
evolution to finite populations is the characterization of the so-called error threshold 
phenomenon which limits the length of the molecules and, consequently, the amount 
of information they can store 0. This phenomenon poses an interesting challenge 
to the theories of the origin of life, since it prevents the emergence of huge molecules 
which could carry the necessary information for building a complex metabolism. 
Moreover, since modern theories of integration of information in pre-biotic systems 
involve the compartmentation of a small number of molecules, the understanding of 
the effects of the error propagation in finite populations has become a major issue 
to the theories of the origin of life (I . 

The quasispecies model was originally formulated within a deterministic chem- 
ical kinetic framework based on a set of ordinary differential equations for the 
concentrations of the different types of molecules that compose the population. 
Such formulation, however, is valid only in the limit where the total number of 
molecules, denoted by N, goes to infinity. In the binary version of the quasispecies 
model, a molecule is represented by a string of L digits (si, S2, . . . , si), with the 
variables s a allowed to take on only two different values, say s a = 0, 1, each of 
which representing a different type of monomer used to build the molecule. The 
concentrations Xi of molecules of type i — 1,2, ... ,2 evolve in time according to 
the equations |l], ||] 

d ^=Y,W ij x j -${t)x i , (I) 

j 

where $(£) is a dilution flux that keeps the total concentration constant. This flux 
introduces a nonlinearity in Eq. (^) , and is determined by the condition dxi /dt — 
0. In particular, assuming ^\ x i — 1 yields 

* = J2w ijXj . (2) 

The elements of the replication matrix Wij depend on the replication rate or fitness 
Ai of the strings of type i as well as on the Hamming distance d(i, j) between strings 
i and j. They are given by jl], ^| 

W u = Ai q" (3) 

and 

W ij =A j q L - d W(l- q f {i 'V i^j, (4) 

where < q < 1 is the single-digit replication accuracy, which is assumed to be the 
same for all digits. 

The quasispecies concept and the error threshold phenomenon are illustrated 
more neatly for the single-sharp-peak replication landscape, in which we ascribe 
the replication rate a > 1 to the so-called master string, say (1,1,. ...I), and the 
replication rate 1 to the remaining strings. In this context, the parameter a is 
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termed selective advantage of the master string. As the replication accuracy q 
decreases, two distinct regimes are observed in the population composition in the 
deterministic case: the quasispecies regime characterized by the presence of the 
master string together with its close neighbors, and the uniform regime where the 
2 L strings appear in the same proportion. The transition between these regimes 
takes place at the error threshold q c , whose value depends on the parameters L and 
a ||]. However, even in the deterministic case, N — > oo, a genuine thermodynamic 
order-disorder phase transition will occur in the limit L — ► oo only @, ^, §|- To 
study this transition for large L, it is convenient to introduce the probability of 
exact replication of an entire string, namely, 



for L — > oo jl], [| . A recent finite-size scaling study of the sharpness of the threshold 
transition indicates that the characteristics of the transition persist across a range 
of Q of order L^ 1 about Q c @. 

Although several theoretical frameworks have been proposed to generalize the 
deterministic kinetic formulation of the quasispecies model so as to take into account 
the effect of finite N || || [h], [ll], [l^, , the somewhat uncontrolled approximations 
used in those analyses have hindered a precise characterization of the error threshold 
for finite populations. In particular, Nowak and Schuster [jl0| employed a simple 
birth and death model, whose deterministic limit, however, does not yield the 
stationary distribution predicted by Eq. ([!]), as well as numerical simulations based 
on Gillespie's algorithm [Q to show that an appropriately defined Q C {N) tends 
to the deterministic value l/a with N~ x / 2 for sufficiently large populations. A 
similar result was obtained by neglecting the possibility of occurrence of multiple 
errors during the replication of a molecule ]l2| ■ A more drastic approximation that 
neglects linkage disequilibrium at the population level yields that Q C (N) increases 
linearly with 1/N |13| . Of course, since there is no generally accepted definition 
of error threshold for finite N (and for finite L as well), denoted above by Q C (N), 
there are some arbitrariness in those analyses. 

In this paper we follow a more direct approach to characterize the error thresh- 
old transition for finite N, which dispenses with a definition for Q C (N). As men- 
tioned before, since a genuine phase transition occurs in the limits N — > oo and 
L — > oo only, we study a stochastic (i.e. finite N) version of the quasispecies model 
with L — > oo and q — > 1 so that Q = q L is finite. In this limit the problem 
simplifies enormously as the probability of any string becoming a master string 
due to replication errors is of order 1/L and so can be safely neglected. Besides, 
since for the single-sharp-peak replication landscape the strings can be classified 
in two types only: the master strings and the error tail, which comprises all other 



Q = q L 



(5) 



so that the discontinuous transition occurs at 




(6) 
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strings, the population at any given generation can be described by the single in- 
teger n = 0, 1, . . . , N, which gives the number of master strings in the population. 
The goal then is to calculate the probability distribution that at generation t there 
are exactly n master strings in the population. This quantity, denoted by Vt(n), 
obeys the recursion equation 

N 

V t+ i(n) = Y,T(n,m)Vt(m) (7) 

m=0 

with the elements of the transition matrix T given by 

T(n,m) = jr( N k ) ( k n ) w k m (l-w m f- k Q n (l-Q) k - n , (8) 

k=n ^ / V / 

where 

ma 



■ v (9) 
iv — m + ma 

is the relative fitness of the master strings. In writing Eq. (||) we have followed 
the prescription used in the implementation of the standard genetic algorithm p"5| : 
first the natural selection process acting via differential reproduction is considered 
and then the mutation process. We note that J2 n ^( n > m ) = 1 ^ m an d T(0, 0) = 1. 
Moreover the largest eigenvalue of T is Ao = 1 and its corresponding eigenvector 
is lj = (1,0, .. . ,0). This stochastic model is easily recognized as the celebrated 
Kimura-Crow infinite allele model |l6|, |l7j which has been extensively studied within 
the diffusion approximation for large N. However, for arbitrary values of Q and a 
the solutions of the partial differential equations are too complicated to be of any 
utility to our purposes [p7| . 

As for finite N the fluctuations, either in the reproduction or mutation pro- 
cesses, will ultimately lead to the irreversible loss of all copies of the master string 
from the population, the asymptotic solution of Eq. (Q) is simply Vodn) — S n Q. 
Our goal is to determine how the characteristic time, r, that governs the vanishing 
of the master strings from the population depends on N, Q and a. 

Before proceeding on the analysis of the stochastic problem, it is instructive to 
discuss briefly the deterministic limit N — > oo. In this case the average number of 
master strings obeys the recursion equation 

N N 

(n) t +i = ^^JiT^m)?,^) 

= Qa(n) t , (10) 
whose solution is (n) t — (Qa) {n)o. Hence, in the deterministic regime we find 
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which diverges at Q = Q c = 1/a, signalling thus the existence of a phase transition 
in the limit N — > oo. Clearly, for Q > Q c the master strings are always present in 
the population so that r is infinite in this entire region. 

We consider now the finite N regime. In this case the recursion equations for 
the moments of n do not yield useful information since, as usual, the moment of 
order p depends on the moment of order p+ 1 evaluated at the previous generation. 
We resort then to a direct calculation of the probability distribution Vtim). More 
specifically, we will focus on the calculation of 'Pt(O), since this is the quantity that 
measures the rate of vanishing of the master strings from the population. Although 
■Pt(O) could be evaluated through a series of matrix multiplications, a simple linear 
algebra calculation yields fl7f 

N 



Vt{0) = J2 Cnln ° X n 



n=0 

= 1 + cil w \\ + . . . + cnIno^n (12) 

where X n are the eigenvalues of T, l n o are the zeroth components of the eigenvectors 
l n , and c n are parameters that depend on the initial state Vo{n). Also we have used 
Ao = Iqo = Co = 1. Assuming without loss of generality that 1>Ai>...>Aat>0, 
in the limit of large t we find 

1 - P t (0) ^ Ce~ t/T (13) 

where 

(14) 



lnAi 

and C is a constant that depends on the initial state. Thus the problem becomes the 
one of finding the second largest eigenvalue of the nonsymmetric matrix T. Since 
the largest eigenvalue and its corresponding eigenvector are already known, this 
numerical problem yields easily to the vector iteration method |T§[ ]. Alternatively, 
we could find r by following the time evolution of Vt (0) , obtained directly through 
the recursion equations (0), for a few generations and then plotting ln [1 — Pt(0)} 
against the generation number t. We have verified that both methods yield the 
same results for r. 

In Fig. 1 we present the dependence of ln r on the probability of exact replication 
of an entire string, Q, for a = 2 and several values of N. The finite N effects are 
negligible for values of Q smaller than, though not too close, Q c , as indicated by 
the very good agreement between the finite data and the theoretical prediction 



for N — > oo given in Eq. (11). Since we expect r to increase exponentially with 
increasing N for Q > Q c , and to tend towards its limiting value, Eq. (pi]), also 
exponentially with N for Q < Q c , the issue is then to determine the dependence 
of r on N at the critical point Q = Q c . In Fig. 2 we present lnr calculated at 
Q c = 1/a against InTV for different values of a. These results indicate clearly that 
at the critical point r increases like N 1 / 2 , irrespective of the value of a. Once 
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we have identified the rescaling of In r that leads to the collapsing of the data for 
different N at Q = Q c , the next step is to determine the sharpness of the transition, 
namely, the range of Q about Q c where the transition characteristics persists. This 
is achieved by assuming that the size of this region shrinks to zero like N~ 1 / 1 ' as 
./V — > oo, where the exponent v > is estimated using hnite-size scaling or, more 
precisely, the data collapsing method ]l9| ]. In Fig. 3 we show the collapse of the 
data for different N obtained with ^ = 2 for a = 2, 10 and 50. Although for a = 2 
we can achieve a good-quality data collapse using relatively small population sizes 
(N > 200), for larger values of a, however, a similar quality collapsing can only 
be obtained using larger values of N (i.e. N > 400). In summary, the results of 
the data collapsing method indicate that the dependence of r on N in the critical 
region is very well described by the scaling assumption 

T = iV 1/2 /a [(Q - Qc) ^ 1/2 ] , (15) 

where f a is a scaling function, whose specific form depends on the parameter a. 

To appreciate the effect of the selective advantage parameter a on the quality 
of the data collapsing results presented in Fig. 3, next we consider in some detail 
the case a — ► oo and N finite. Using w m — > 1 for m > yields 

T{n,m) =T(n) = (^) Q" (1 - Q)^" m > 0. (16) 



As before, T(0, 0) = 1 and T(n, 0) = for n > 0. In this case the eigenvalues of T 
can easily be ca 
Xn = 0. Hence, 



can easily be calculated analytically yielding Ao = 1, Ai = J2n=i Aa = . . . = 



In 



i-(i-Qf 



(17) 



Finally, taking the limits Q — > Q c = and N — > oo, we can easily verify that v = 1 
in this limit. This interesting result suggests that uncontrolled approximations 
and simplifications of the original model which enhance the selective advantage 
of the master string or the finite population sampling effects are expected to give 
unreliable estimates of the exponent v. Moreover, care must be taken in restricting 
the finite-size scaling analysis to the regime N ^> a to avoid underestimating the 
value of v. We note, of course, that the situation of interest is N — > oo while a 
remains finite. 

To conclude, the collapse of the data for different N into a-dependent scaling 
functions presented in Fig. 3 and summarized in the scaling assumption ( |l5| ) provide 
a full characterization of the error threshold transition, signalled in our model by the 
divergence of r at Q c — 1/a, for large N. We emphasize that the main advantage of 
our approach is that it does not rely upon any arbitrary definition of error threshold 
for finite populations. 
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Figure captions 



Fig. 1 Logarithm of the characteristic time for the vanishing of the master strings 
from the population, lnr, as a function of the probability of exact replication, Q, 
for a = 2, and N = 100 (*), 200 (O), 300 (□), 500 (A), and 600 (x). The solid line 
is the theoretical prediction for N — ► oo. 

Fig. 2 Logarithm of the characteristic time for the vanishing of the master strings 
from the population, lnr, calculated at Q c = l/a as a function of the logarithm of 
the population size, In TV, for a = 2 (O), 10 (A), and 50 (□). 

Fig. 3 Properly rescaled logarithm of the characteristic time for the vanishing of the 
master strings from the population, lnr/ln./V 1 / 2 , as a function of (Q — Q c ) N 1 / 2 
for (from top to bottom at Q — Q c ) a = 2, 10 and 50. The convention is N = 200 
(O), 300 (□), 400 (O), 500 (A), and 600 (x). For a = 10 and 50 only the data for 
N > 400 are presented. 
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